Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction

نویسندگان

  • Haithem Afli
  • Loïc Barrault
  • Holger Schwenk
چکیده

Discovering parallel data in comparable corpora is a promising approach for overcoming the lack of parallel texts in statistical machine translation and other NLP applications. In this paper we propose an alternative to comparable corpora of texts as resources for extracting parallel data: a multimodal comparable corpus of audio and texts. We present a novel method to detect parallel phrases from such corpora based on splitting comparable sentences into fragments, called phrases. The audio is transcribed by an automatic speech recognition system, split into fragments and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language, split also into fragments, and extract parallel phrases. We compared our method with parallel sentences extraction techniques. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-ofthe-art baseline.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including paral...

متن کامل

Automatic Bilingual Phrase Extraction from Comparable Corpora

In this work we present an approach for extracting parallel phrases from comparable news articles to improve statistical machine translation. This is particularly useful for under-resourced languages where parallel corpora are not readily available. Our approach consists of a phrase pair generator that automatically generates candidate parallel phrases and a binary SVM classifier that classifie...

متن کامل

Parallel Texts Extraction from Multimodal Comparable Corpora

Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sour...

متن کامل

Extracting Parallel Phrases from Comparable Data

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-res...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013